Decentralized Data Fusion and Active Sensing with Mobile Sensors for 
Modeling and Predicting Spatiotemporal Traffic Phenomena 
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Abstract 

The problem of modeling and predicting spa- 
tiotemporal traffic phenomena over an urban road 
network is important to many traffic applica- 
tions such as detecting and forecasting conges- 
tion hotspots. This paper presents a decentral- 
ized data fusion and active sensing (D 2 FAS) al- 
gorithm for mobile sensors to actively explore 
the road network to gather and assimilate the 
most informative data for predicting the traffic 
phenomenon. We analyze the time and commu- 
nication complexity of D 2 FAS and demonstrate 
that it can scale well with a large number of ob- 
servations and sensors. We provide a theoret- 
ical guarantee on its predictive performance to 
be equivalent to that of a sophisticated central- 
ized sparse approximation for the Gaussian pro- 
cess (GP) model: The computation of such a 
sparse approximate GP model can thus be par- 
allelized and distributed among the mobile sen- 
sors (in a Google-like MapReduce paradigm), 
thereby achieving efficient and scalable predic- 
tion. We also theoretically guarantee its active 
sensing performance that improves under various 
practical environmental conditions. Empirical 
evaluation on real-world urban road network data 
shows that our D 2 FAS algorithm is significantly 
more time-efficient and scalable than state-of- 
the-art centralized algorithms while achieving 
comparable predictive performance. 

1 Introduction 

Knowing and understanding the traffic conditions and 
phenomena over road networks has become increas- 
ingly important to the goal of achieving smooth-flowing, 
congestion-free traffic, especially in densely-populated ur- 
ban cities. According to a 2011 urban mobility report 
(Schrank et al. 2011 1, traffic congestion in the USA has 



spatiotemporally varying traffic phenomena (e.g., speeds 
and travel times along road segments) are predicted accu- 
rately enough in real time to detect and forecast the con- 
gestion hotspots; network-level (e.g., ramp metering, road 
pricing) and user-level (e.g., route replanning) measures 
can then be taken to relieve this congestion, so as to im- 
prove the overall efficiency of road networks. 

In practice, it is non-trivial to achieve real-time, accu- 
rate prediction of a spatiotemporally varying traffic phe- 
nomenon because the quantity of sensors that can be 
deployed to observe an entire road network is cost- 
constrained. Traditionally, static sensors such as loop de- 



tectors (Krause et al. 2008a Wang and Papageorgiou 



2005 ) are placed at designated locations in a road network 



to collect data for predicting the traffic phenomenon. How- 
ever, they provide sparse coverage (i.e., many road seg- 
ments are not observed, thus leading to data sparsity), incur 
high installation and maintenance costs, and cannot repo- 
sition by themselves in response to changes in the traffic 
phenomenon. Low-cost GPS technology allows the col- 
lection of traffic data using passive mobile probes (Work 
et al. 2010[) (e.g., taxis/cabs). Unlike static sensors, they 



can directly measure the travel times along road segments. 
But, they provide fairly sparse coverage due to low GPS 
sampling frequency (i.e., often imposed by taxi/cab com- 
panies) and no control over their routes, incur high ini- 
tial implementation cost, pose privacy issues, and produce 
highly-varying speeds and travel times while traversing the 
same road segment due to inconsistent driving behaviors. 
A critical mass of probes is needed on each road segment 



to ease the severity of the last drawback (Srinivasan and 



caused 1.9 billion gallons of extra fuel, 4.8 billion hours of 
travel delay, and $101 billion of delay and fuel cost. Such 
huge resource wastage can potentially be mitigated if the 



|Jovanis| [T996 ) but is often hard to achieve on non-highway 
segments due to sparse coverage. In contrast, we propose 
the use of active mobile probes ( Turner et al. 1998| l to over- 
come the limitations of static and passive mobile probes. In 
particular, they can be directed to explore any segments of 
a road network to gather traffic data at a desired GPS sam- 
pling rate while enforcing consistent driving behavior. 

How then do the mobile probes/sensors actively explore a 
road network to gather and assimilate the most informative 



observations for predicting the traffic phenomenon? There 
are three key issues surrounding this problem, which will 
be discussed together with the related works: 

Models for predicting spatiotemporal traffic phenom- 
ena. The spatiotemporal correlation structure of a traffic 
phenomenon can be exploited to predict the traffic condi- 
tions of any unobserved road segment at any time using 
the observations taken along the sensors' paths. To achieve 



this, existing Bayesian filtering frameworks ( Chen et al. 



2011 Wang and Papageorgiou 2005 Work et al. 2010) 



utilize various handcrafted parametric models predicting 
traffic flow along a highway stretch that only correlate ad- 
jacent segments of the highway. So, their predictive perfor- 
mance will be compromised when the current observations 
are sparse and/or the actual spatial correlation spans multi- 
ple segments. Their strong Markov assumption further ex- 
acerbates this problem. It is also not shown how these mod- 
els can be generalized to work for arbitrary road network 
topologies and more complex correlation structure. Exist- 



ing multivariate parametric traffic prediction models (Ka 



marianakis and Prastacos 2003 Min and Wynter 2011 1 



do not quantify uncertainty estimates of the predictions and 
impose rigid spatial locality assumptions that do not adapt 
to the true underlying correlation structure. 

In contrast, we assume the traffic phenomenon over an ur- 
ban road network (i.e., comprising the full range of road 
types like highways, arterials, slip roads) to be realized 
from a rich class of Bayesian non-parametric models called 
the Gaussian process (GP) (Section [2]) that can formally 
characterize its spatiotemporal correlation structure and be 
refined with growing number of observations. More im- 
portantly, GP can provide formal measures of predictive 
uncertainty (e.g., based on variance or entropy criterion) 
for directing the sensors to explore highly uncertain areas 
of the road network. |Krause et a/. | ( 2008a} used GP to rep- 
resent the traffic phenomenon over a network of only high- 
ways and defined the correlation of speeds between high- 
way segments to depend only on the geodesic (i.e., shortest 
path) distance of these segments with respect to the net- 
work topology; their features are not considered. Neu- 



mann et al. (2009) maintained a mixture of two indepen- 



dent GPs for flow prediction such that the correlation struc- 
ture of one GP utilized road segment features while that 
of the other GP depended on manually specified relations 
(instead of geodesic distance) between segments with re- 
spect to an undirected network topology. Different from the 
above works, we propose a relational GP whose correlation 
structure exploits the geodesic distance between segments 
based on the topology of a directed road network with ver- 
tices denoting road segments and edges indicating adjacent 
segments weighted by dissimilarity of their features, hence 
tightly integrating the features and relational information. 

Data fusion. The observations are gathered distributedly 
by each sensor along its path in the road network and have 



to be assimilated in order to predict the traffic phenomenon. 
Since a large number of observations are expected to be 
collected, a centralized approach to GP prediction cannot 
be performed in real time due to its cubic time complexity. 

To resolve this, we propose a decentralized data fusion ap- 
proach to efficient and scalable approximate GP predic- 
tion (Section [3}. Existing decentralized and distributed 
Bayesian filtering frameworks for addressing non-traffic re- 
lated proble ms ([Chung et al. [[20041 |Coates| |2004[ |Q lfati 
|Saber| |2005[ [Rosencra ntz et al. \ |2003[ |Sukkarieh et al. 
|2003| ) will face the same difficulties as their centralized 
counterparts described above if applied to predicting traf- 
fic phenomena, thus resulting in loss of predictive perfor- 
mance. Distributed regression algorithms ( jGuestrin et al. 



2004 ; Pas kin and Guestri n 2004 ) for static sensor networks 
gain efficiency from spatial locality assumptions, which 
cannot be exploited by mobile sensors whose paths are 
not constrained by locality. [Cortes (2009) proposed a dis- 
tributed data fusion approach to approximate GP predic- 
tion based on an iterative Jacobi overrelaxation algorithm, 
which incurs some critical limitations: (a) the past obser- 
vations taken along the sensors' paths are assumed to be 
uncorrected, which greatly undermines its predictive per- 
formance when they are in fact correlated and/or the cur- 
rent observations are sparse; (b) when the number of sen- 
sors grows large, it converges very slowly; (c) it assumes 
that the range of positive correlation has to be bounded by 
some factor of the communication range. Our proposed de- 
centralized algorithm does not suffer from these limitations 
and can be computed exactly with efficient time bounds. 

Active sensing. The sensors have to coordinate to ac- 
tively gather the most informative observations for mini- 
mizing the uncertainty of modeling and predicting the traf- 
fic phenomenon. Existing centralized ( Low et aL \ [2008 



2009a| [20TT) and decentralized ( |Low et al.\ |2012| [Strain 



ders et al. 2009[l active sensing algorithms scale poorly 



with a large number of observations and sensors. We pro- 
pose a partially decentralized active sensing algorithm that 
overcomes these issues of scalability (Section|4|i. 

This paper presents a novel Decentralized Data Fusion and 
Active Sensing (D 2 FAS) algorithm (Sections [3] and Q for 
sampling spatiotemporally varying environmental phenom- 
ena with mobile sensors. Note that the decentralized data 
fusion component of D 2 FAS can also be used for static 
and passive mobile sensors. The practical applicability of 
D 2 FAS is not restricted to traffic monitoring; it can be used 
in other environmental sensing applications such as mineral 



prospecting (Low et al. 20071, monitoring of ocean and 



freshwater phenomena (|Dolan et al.\ |2009[ |Podnar et al. 



2010 |Low et al. [[2009b[|20 1 1 1|2012) (e.g., plankton bloom 



anoxic zones), forest ecosystems, pollution (e.g., oil spill), 
or contamination (e.g., radiation leak). The specific contri- 
butions of this paper include: 

• Analyzing the time and communication overheads of 



D 2 FAS (Section^: We prove that D 2 FAS can scale bet- 
ter than existing state-of-the-art centralized algorithms 
with a large number of observations and sensors; 
Theoretically guaranteeing the predictive performance 
of the decentralized data fusion component of D 2 FAS to 
be equivalent to that of a sophisticated centralized sparse 
approximation for the GP model (Section |3j: The com- 
putation of such a sparse approximate GP model can thus 
be parallelized and distributed among the mobile sen- 
sors (in a Google-like MapReduce paradigm), thereby 
achieving efficient and scalable prediction; 
Theoretically guaranteeing the performance of the par- 
tially decentralized active sensing component of D 2 FAS, 
from which various practical environmental conditions 
can be established to improve its performance; 
Developing a relational GP model whose correlation 
structure can exploit both the road segment features and 



road network topology information (Section 2.1 



• Empirically evaluating the predictive performance, time 
efficiency, and scalability of the D 2 FAS algorithm on a 
real-world traffic phenomenon (i.e., speeds of road seg- 
ments) dataset over an urban road network (Section [6j: 
D 2 FAS is more time-efficient and scales significantly 
better with increasing number of observations and sen- 
sors while achieving predictive performance close to that 
of existing state-of-the-art centralized algorithms. 

2 Relational Gaussian Process Regression 

The Gaussian process (GP) can be used to model a spa- 
tiotemporal traffic phenomenon over a road network as fol- 
lows: The traffic phenomenon is defined to vary as a real- 
ization of a GP. Let V be a set of road segments represent- 
ing the domain of the road network such that each road seg- 
ment s £ V is specified by a p-dimensional vector of fea- 
tures and is associated with a realized (random) measure- 
ment z s (Z s ) of the traffic condition such as speed if s is 
observed (unobserved). Let {Z s } se y denote a GP, that is, 
every finite subset of {Z s } se v follows a multivariate Gaus- 
sian distribution (Rasmussen and Will iams 2006). Then, 
the GP is fully specified by its prior mean p s = ¥.[Z S ] and 
covariance a ss > = cov[Z s , Z s i] for all s,s' £ V. In par- 
ticular, we will describe in Section [2~T| how the covariance 
a ss i for modeling the correlation of measurements between 
all pairs of segments s,s' £ V can be designed to exploit 
the road segment features and the road network topology. 

A chief capability of the GP model is that of performing 
probabilistic regression: Given a set D C V of observed 
road segments and a column vector zd of corresponding 
measurements, the joint distribution of the measurements 
at any set Y C V \ D of unobserved road segments re- 
mains Gaussian with the following posterior mean vector 
and covariance matrix 

fJ-Y\D — Mr + ^YD^DuiZD ~ A*£>) (1) 
^YY\D — Syy — Tiyd^jjd^DY (2) 



where [iy (pd) is a column vector with mean components 
p s for all s £ Y (s £ D), T,yd (^dd) is a covariance ma- 
trix with covariance components <j ss i for all s £ Y, s' £ D 
(s, s' £ D), and E^y is the transpose of £yo- The poste- 
rior mean vector py\D <TTT> is used to predict the measure- 
ments at any set Y of unobserved road segments. The pos- 
terior covariance matrix J^yy\d which is independent 
of the measurements zd, can be processed in two ways to 
quantify the uncertainty of these predictions: (a) the trace 
of Eyym yields the sum of posterior variances £ ss m over 
all s £ Y; (b) the determinant of Eyym is used in calcu- 
lating the Gaussian posterior joint entropy 



U[Z Y \Z D ] ± ilog(2^e 



Vy|z?| ■ (3) 

In contrast to the first measure of uncertainty that assumes 
conditional independence between measurements in the set 
Y of unobserved road segments, the entropy-based mea- 
sure |3]) accounts for their correlation, thereby not overesti- 
mating their uncertainty. Hence, we will focus on using the 
entropy-based measure of uncertainty in this paper. 

2.1 Graph-Based Kernel 

If the observations are noisy (i.e., by assuming additive in- 
dependent identically distributed Gaussian noise with vari- 
ance (T 2 ), then their prior covariance a SS ' can be expressed 
as a ss i = k(s, s') + <7 2 <5 S s' where S ss > is a Kronecker delta 
that is 1 if s = s' and otherwise, and k is a kernel function 
measuring the pairwise "similarity" of road segments. For 
a traffic phenomenon (e.g., road speeds), the correlation of 
measurements between pairs of road segments depends not 
only on their features (e.g., length, number of lanes, speed 
limit, direction) but also the road network topology. So, the 
kernel function is defined to exploit both the features and 
topology information, which will be described next. 

Definition 1 (Road Network) Let the road network be 
represented as a weighted directed graph G — (V, E, m) 
that consists of 

• a set V of vertices denoting the domain of all possible 
road segments, 

• a set E C V x V of edges where there is an edge (s, s') 
from s £ V to s' £ V iff the end of segment s connects 
to the start of segment s' in the road network, and 

• a weight function m : E — > R + measuring the stan- 
dardized Manhattan distance (Borg and Groenen. \2005\ 



m(( S , S ')) = E? 



[s']i|/ r i of each edge (s,s') 



where [s]i ( [s']i) is the i-th component of the feature vec- 
tor specifying road segment s ( s'), and Ti is the range 
of the i-th feature. The weight function m serves as a 
dissimilarity measure between adjacent road segments. 

The next step is to compute the shortest path distance 
d(s, s') between all pairs of road segments s, s' £ V (i.e., 
using Floyd- Warshall or Johnson's algorithm) with respect 
to the topology of the weighted directed graph G. Such a 
distance function is again a measure of dissimilarity, rather 
than one of similarity, as required by a kernel function. Fur- 



thermore, a valid GP kernel needs to be positive semidef- 
inite and symmetric (Scholkopf and Smola 2002), which 
are clearly violated by d. 

To construct a valid GP kernel from d, multi-dimensional 



scaling ( Borg and Groenen 2005 1 is applied to embed 



the domain of road segments into the p' -dimensional Eu- 
clidean space M. p . Specifically, a mapping g : V — > 
M. p is determined by minimizing the squared loss g* = 
a,Tgmm g J2 s ^ eV (d(s,s')~\\g(s)-g{s')\\) 2 . With a small 
squared loss, the Euclidean distance ||g*(s) — <?*(s')|| be- 
tween g*(s) and g*(s') is expected to closely approximate 
the shortest path distance d(s, s') between any pair of road 
segments s and s' . After embedding into Euclidean space, 
a conventional kernel function such as the squared expo- 
nential one (Rasmussen and Williams , 2006) can be used: 



fc(s, s) = o\ exp -- 



p 

i=l 



[9*{s)]i-[9*{s')] 



where [g*(s)\i ([g*(s'))i) is the z-th component of the p'- 
dimensional vector g* (s) (g* (s')), and the hyperparameters 
cr s , Ixi • ■ ■ ,l p i are, respectively, signal variance and length- 
scales that can be learned using maximum likelihood es- 



timation ( Rasmussen and Williams 2006). The resulting 
kernel function ^]is guaranteed to be valid. 

2.2 Subset of Data Approximation 

Although the GP is an effective predictive model, it faces a 
practical limitation of cubic time complexity in the number 
\D\ of observations; this can be observed from computing 
the posterior distribution (i.e., ([TJi and |2|i), which requires 
inverting covariance matrix J^dd an d incurs 0(|D| 3 ) time. 
If \D\ is expected to be large, GP prediction cannot be per- 
formed in real time. For practical usage, we have to resort 
to computationally cheaper approximate GP prediction. 

A simple method of approximation is to select only a sub- 
set U of the entire set D of observed road segments (i.e., 
U C D) to compute the posterior distribution of the mea- 
surements at any set 7 C V \ D of unobserved road seg- 
ments. Such a subset of data (SoD) approximation method 
produces the following predictive Gaussian distribution, 
which closely resembles that of the full GP model (i.e., by 
simply replacing D in ([T]i and Q with U): 

Hy\u = + ^yu^uu( z u - fJ-u) (4) 

^YY\U = ^YY — ^YU^UU^UY ■ (5) 

Notice that the covariance matrix S^j/ to be inverted only 
incurs 0(|[/| 3 ) time, which is independent of \D\. 

The predictive performance of SoD approximation is sensi- 
tive to the selection of subset U . In practice, random subset 
selection often yields poor performance. This issue can be 
resolved by actively selecting an informative subset U in 
an iterative greedy manner: Firstly, U is initialized to be 



'For spatiotemporal traffic modeling, the kernel function k can 
be extended to account for the temporal dimension. 



an empty set. Then, all road segments in D \ U are scored 
based on a criterion that can be chosen from, for example, 
the works of (Krause et ai , 2008b"{ [Lawrence et al.\ [2003 ; 
Seeger and Williams 2003 1. The highest-scored segment 
is selected for inclusion in U and removed from D. This 
greedy selection procedure is iterated until U reaches a pre- 
defined size. Among the various criteria introduced earlier, 
the differential entropy score ( Lawrence et a/.|[2003| ) is re- 
ported to perform well (Oh et al. 2010| ); it is a monotonic 
function of the posterior variance S ss |jy (|5j), thus resulting 
in the greedy selection of a segment s <E D\U with the 
largest variance in each iteration. 

3 Decentralized Data Fusion 

In the previous section, two centralized data fusion ap- 
proaches to exact (i.e., ([TJ) and Q) and approximate (i.e., 
Q and (|5])) GP prediction are introduced. In this section, 
we will discuss the decentralized data fusion component 
of our D 2 FAS algorithm, which distributes the computa- 
tional load among the mobile sensors to achieve efficient 
and scalable approximate GP prediction. 

The intuition of our decentralized data fusion algorithm is 
as follows: Each of the K mobile sensors constructs a local 
summary of the observations taken along its own path in the 
road network and communicates its local summary to ev- 
ery other sensor. Then, it assimilates the local summaries 
received from the other sensors into a globally consistent 
summary, which is exploited for predicting the traffic phe- 
nomenon as well as active sensing. This intuition will be 
formally realized and described in the paragraphs below. 

While exploring the road network, each mobile sensor sum- 
marizes its local observations taken along its path based on 
a common support set U C V known to all the other sen- 
sors. Its local summary is defined as follows: 

Definition 2 (Local Summary) Given a common support 
set U C V known to all K mobile sensors, a set C 
V of observed road segments and a column vector zo k of 
corresponding measurements local to mobile sensor k, its 
local summary is defined as a tuple (iy, Sy-y) where 



-1 

D k D k \U\ 
1 



— ^UD k ^ 

is defined in a similar manner to (|5|). 



(6) 
(7) 



such that Y, DkDk \ U 
Remark. Unlike SoD (Section 2.2 1, the support set U of 



road segments does not have to be observed, since the lo- 
cal summary (i.e., ([6]) and |7]i) is independent of the cor- 
responding measurements zjj. So, U does not need to be 



a subset of D = |Jfe=i ^k- To select an informative sup- 



port set U from the set V of all possible segments in the 
road network, an offline active selection procedure similar 



to that in the last paragraph of Section 2.2 can be performed 
just once prior to observing data to determine U. In con- 
trast, SoD has to perform online active selection every time 
new road segments are being observed. 



By communicating its local summary to every other sensor, 
each mobile sensor can then construct a globally consistent 
summary from the received local summaries: 

Definition 3 (Global Summary) Given a common sup- 
port set U C V known to all K mobile sensors and 
the local summary (i^E^) of every mobile sensor 
k = 1, . . . , K, the global summary is defined as a tuple 
(%, XW) where K 

(8) 



Map Reduce paradigm (Chu et al. 2007 1, thereby improv 



fc=i 



K 



■'UU 



k=l 



k 

UU 



(9) 



Remark. In this paper, we assume all-to-all communica- 
tion between the K mobile sensors. Supposing this is not 
possible and each sensor can only communicate locally 
with its neighbors, the summation structure of the global 
summary (specifically, ([8]l and makes it amenable to 



be constructed using distributed consensus filters (Olfati 



Saber, 2005). We omit these details since they are beyond 
the scope of this paper. 

Finally, the global summary is exploited by each mobile 
sensor to compute a globally consistent predictive Gaussian 
distribution, as detailed in Theorem[TJ\ below, as well as to 
perform decentralized active sensing (Section|4]): 

Theorem 1 Let a common support set U C V be known to 
all K mobile sensors. 

A. Given the global summary (zu,tluu)> each mo- 
bile sensor computes a globally consistent predictive 
Gaussian distribution Af(~p Y , £yy) of the measure- 
ments at any set Y of unobserved road segments where 

^YU^uu' Z U 



fly = p,Y + 



Yyy — Syy — Y^yui^uU ~ ^Uu)^UY 



(10) 

(11) 



B. Let jV(^p, T,y T y^ D ) be the predictive Gaussian dis- 
tribution computed by the centralized sparse par- 
tially independent training conditional (PITC) ap- 
proximation of GP model \Quinonero -Candel a and\ 
\Rasmu ssen, 2005) where 

Myjo = Mr + T YD (T DD + A)-\z D - fi D ) (12) 

^£yy -Tyd{T DD +K)- 1 T DY (13) 

such that 

Tbb' — ^BU^UU^UB' (14) 
and A is a block-diagonal matrix constructed from 
the K diagonal blocks of "£>dd\u> eacn of which is 
a matrix ^D k D k \u f or & = 1, ■ ■ ■ ,K where D = 



y^PITC 

n YY\D 



Ufc=i D k- Then, fi Y = /i™£> and £yy = ^ V YY \ D - 



The proof of Theorem [Tj3 is given in Appendix [A] The 
equivalence result of Theorem[T]B bears two implications: 

Remark 1. The computation of PITC can be parallelized 
and distributed among the mobile sensors in a Google-like 



ing the time efficiency of prediction: Each of the K map- 
pers (sensors) is tasked to compute its local summary while 
the reducer (any sensor) sums these local summaries into a 
global summary, which is then used to compute the pre- 
dictive Gaussian distribution. Supposing \Y\ < \U\ for 
simplicity, the 0(\D\((\D\ / K) 2 + \U\ 2 )) time incurred by 
PITC can be reduced to 0((\D\/K) 3 + |[/| 3 + \U\ 2 K) 
time of running our decentralized algorithm on each of the 
K sensors, the latter of which scales better with increasing 
number \D\ of observations. 

Remark 2. We can draw insights from PITC to elucidate an 
underlying property of our decentralized algorithm: It is as- 
sumed that Zd k , Zy are conditionally indepen- 
dent given the measurements at the support set U of road 
segments. To potentially reduce the degree of violation of 
this assumption, an informative support set U is actively 
selected, as described earlier in this section. Furthermore, 
the experimental results on real-world urban road network 
data^j (Section ^ show that D 2 FAS can achieve predic- 
tive performance comparable to that of the full GP model 
while enjoying significantly lower computational cost, thus 
demonstrating the practicality of such an assumption for 
predicting traffic phenomena. The predictive performance 
of D 2 FAS can be improved by increasing the size of U at 
the expense of greater time and communication overhead. 

4 Decentralized Active Sensing 

The problem of active sensing with K mobile sensors is 
formulated as follows: Given the set D y. C V of observed 
road segments and the currently traversed road segment 
Sfc G V of every mobile sensor k = 1, . . . , K, the mo- 
bile sensors have to coordinate to select the most informa- 
tive walks w*, . . . , w* K of length (i.e., number of road seg- 
ments) L each and with respective origins s\, . . . , sk in the 
road network G: 



{w l ,...,w K y- 



- argmax 

(w 1 ,....w K ) 



I \ K Y 



(15) 



where Y Wk denotes the set of unobserved road segments 
induced by the walk Wk- To simplify notation, let a joint 
walk be denoted by w — (wi, . . . , wk ) (similarly, for w*) 
and its induced set of unobserved road segments be Y. w = 
UfcLi from now on. Interestingly, it can be shown 
using the chain rule for entropy that these maximum- 
entropy walks w* minimize the posterior joint entropy (i.e., 
W[Zv\m U y„+ )\Zd u y„» ]) °f tne measurements at the re- 
maining unobserved segments (i.e., V \ (D [j Y w * )) in the 
road network. After executing the walk wjj, each mobile 
sensor k observes the set Y w * of road segments and updates 
its local information: 



Di 



[jY w 



z D k 



Z D k \JY w 



, Sk terminus of 
(16) 



z Quinonero-Candela and Rasmussen (2005) only illustrated 
the predictive performance of PITC on a simulated toy example. 



Evaluating the Gaussian posterior entropy term in ( fl"5j ) in- 
volves computing a posterior covariance matrix ppusing 
one of the data fusion methods described earlier: If |2]) of 
full GP model (Section |5) or Q of SoD (Section [2~2l is to 
be used, then the observations that are gathered distribut- 
edly by the sensors have to be fully communicated to a 
central data fusion center. In contrast, our decentralized 
data fusion algorithm (Section [3j only requires communi- 
cating local summaries (Definition [2]i to compute (Hi for 
solving the active sensing problem I 



w 



- argmax 

w 

1 




(17) 



I[Z y J^-log(2^e) 



in, I 



Without imposing any structural assumption 



(18) 
solving the 

active sensing problem ( [17} will be prohibitively expensive 
due to the space of possible joint walks w that grows expo- 
nentially in the number K of mobile sensors. To overcome 
this scalability issue for D 2 FAS, our key idea is to construct 
a block-diagonal matrix whose log-determinant closely ap- 
proximates that of 'Ey w y w (ll 1 and exploit the property that 



the log-determinant of such a block-diagonal matrix can be 
decomposed into a sum of log-determinants of its diagonal 
blocks, each of which depends only on the walks of a dis- 
joint subset of the K mobile sensors. Consequently, the ac- 
tive sensing problem can be partially decentralized, leading 
to a reduced space of possible joint walks to be searched, 
as detailed in the rest of this section. 

Firstly, we extend an earlier assumption in Section [3] 
Zd 1 , . . . , Zd k , Zy mi , ■ ■ ■ , Zy w are conditionally inde- 
pendent given the measurements at the support set U of 
road segments. Then, it can be shown via the equiva 
lence to PITC (Theorem [l^) that Z Yw y, 
diagonal blocks of the form Y 1 y VJk Y Wk for fc 
and off-diagonal blocks of the form T,y w t/E^^S 
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comprises 



UU^UY wy 

for fc, fc' = 1, . . . , K and fc ^ fc'. In particular, each off- 
diagonal block of T,y w Y w represents the correlation of mea- 
surements between the unobserved road segments Y Wk and 
Y Wk , along the respective walks Wk of sensor fc and ww 
of sensor fc'. If the correlation between some pair of their 
possible walks is high enough, then their walks have to be 
coordinated. This is formally realized by the following co- 
ordination graph over the K sensors: 

Definition 4 (Coordination Graph) Define the coordina- 
tion graph to be an undirected graph Q = (V, £) that com- 
prises 

• a set V of vertices denoting the K mobile sensors, and 

• a set £ of edges denoting coordination dependencies be- 
tween sensors such that there exists an edge {fc, fc'} in- 
cident with sensors fc € V and fc' G V \ {fc} iff 

(19) 



max 



> e 



for a predefined constant e > where Wk denotes the set 
of possible walks of length L of mobile sensor kfrom ori- 
gin Sk in the road network G and Yw k = U Wk <zw k Y Wh - 



Remark. The construction of Q can be decentralized 
as follows: Since Yjju is symmetric and positive def- 
inite, it can be decomposed by Cholesky factorization 
into S[/(7 = where ^ is a lower triangular matrix 

and ^ T is the transpose of VP. Then, S S [/E^^I][/ S ' = 
(ty\T, Us ) T ^\Z Us < where ^\B denotes the column vec 



tor (f) solving \&</> — B. That is, EsyEy^Euv (19 1 can 



be expressed as a dot product of two vectors ^\Sy s and 
^\I][/ S '; this property is exploited to determine adjacency 
between sensors in a decentralized manner: 



Definition 5 (Adjacency) Let 



(20) 

for k — 1, . . . , K. A sensor k G V is adjacent to sensor 
k € V \ {fc} in coordination graph Q iff 

aT. 



max 



> e . 



(21) 



It follows from the above definition that if each sensor fc 
constructs $fc and exchanges it with every other sensor, 
then it can determine its adjacency to all the other sensors 
and store this information in a column vector at- of length 
K with its fc'-th component being defined as follows: 

{1 if sensor fc is adjacent to sensor fc', 
otherwise. 

By exchanging its adjacency vector with every other 
sensor, each sensor can construct a globally consistent ad- 
jacency matrix Ag = (a\ . . . clk) to represent coordination 
graph g. 

Next, by computing the connected components (say, JC of 
them) of coordination graph Q, their resulting vertex sets 
partition the set V of K sensors into K, disjoint subsets 
Vi , . . . , Vic sucn tnat the sensors within each subset have 
to coordinate their walks. Each sensor can determine its 
residing connected component in a decentralized way by 
performing a depth-first search in Q starting from it as root. 

Finally, construct a block-diagonal matrix T,y w Y w to 
comprise diagonal blocks of the form Sy rav y„ v for 



(w k )k 



and Y, 



t»v„ 



n = 1 , . . . , JC where u;y„ 

Ufcev Y Wk . The active sensing problem (|17|) is then ap- 
proximated by 



1 



S\Y W \ 



J Y W Y W 



max - log(27re)' 
w 2 

K 

= max log(27re)l Y ™ 

K 

= max log(27re)l y ' 1 ' 



(23) 



which can be solved in a partially decentralized manner by 
each disjoint subset V„ of mobile sensors: 

argmax log(27re)l F ™ v ™ I 



w v„ 



£f,„ y,„ 



(24) 



Our active sensing algorithm becomes fully decentralized 
if e is set to be sufficiently large: more sensors become iso- 
lated in Q, consequently decreasing the size n = max \V n \ 



of its largest connected component to 1. As shown in Sec- 
tion [5T| decreasing k improves its time efficiency. On the 
other hand, it tends to a centralized behavior (Ffy by set- 
ting e — >• + : Q becomes near-complete, thus resulting in 

k — y K . 



Let 



i 



max 



(25) 



In the result be- 



ande = 0.5 log (K 15 L 25 k^)' 

low, we prove that the joint walk w = (tuvi > ■ ■ ■ i ) is 
guaranteed to achieve an entropy H[Zy ffi ] (i.e., by plugging 



w into ( 18 1) that is not more than e from the maximum en- 
tropy H[Zy m „ ] achieved by joint walk w* ( 17 1: 



:tion l3l 

'§ & L7 

itnevery sei 



Theorem 2 (Performance Guarantee) IfK 15 L 2 - 5 n^e < 
1, f/zen W[Z Yw .] - H[Zy ffl ] < e. 

The proof of Theorem|2]is given in Appendix [B] The im- 
plication of Theorem |2jis that our partially decentralized 
active sensing algorithm can perform comparatively well 
(i.e., small e) under the following favorable environmental 
conditions: (a) the network of K sensors is not large, (b) 
length L of each sensor's walk to be optimized is not long, 
(c) the largest subset of k sensors being formed to coordi- 
nate their walks (i.e., largest connected component in Q) is 
reasonably small, and (d) the minimum required correlation 
e between walks of adjacent sensors is kept low. 

Algorithm [T] below outlines the key operations of our 
D 2 FAS algorithm to be run on each mobile sensor k, as 
detailed previously in Sections [3]and|4j 

Algorithm 1: D 2 FAS(C/, K, L, k, D kl z Dk , s k ) 

while true do 

/ * Data fusion (Sect! 
Construct local summary by 1 
Exchange local summary witneverysensor i ^ k 
Construct global summary by |8j & J?J 

Predict measurements at unobserved road segments by j 1 0| & |l l| 
/* Active Se nsi ng (Section 
Construct <I> fc by pp} 
Exchange <£> with every sensor i ^ k 
Compute adjacency vector a*, by |2l| & |22| 
Exchange adjacency vector with every sensor i ^ k 
Construct adjacency matrix of coordination graph 
Find vertex set V n of its residing connected com pone nt 
Compute maximum-entropy joint walk wv n by (241 
Execute walk and observe its road segments 
Update local information Dk, Zn h , and s& by |l6| 



5 Time and Communication Overheads 

In this section, the time and communication overheads of 
our D 2 FAS algorithm are analyzed and compared to that of 
centralized active sensing ( fF7] i coupled with the data fusion 
methods: Full GP (FGP) and SoD (Section^. 

5.1 Time Complexity 

The data fusion component of D 2 FAS involves com- 
puting the local and global summaries and the predic- 
tive Gaussian distribution. To construct the local sum- 
mary using ^ and Q, each sensor has to evaluate 



Z Dk n k \u in ^(l^l 3 + \U\{\D\/Kf) time and invert it in 
0((\D\/K) 3 ) time, after which the local summary is ob- 
tained in 0(\U\ 2 \D\/K + \U\(\D\/K) 2 ) time. The global 
summary is computed in 0{\U\ 2 K^j by (|8j and Fi- 
nally, the predictive Gaussian distribution is derived in 
0(\U\ 3 + \ U\\Y\ 2 ) time using (To) and O. Supposing 
\Y\ < \U\ for simplicity, the time complexity of data fu- 
sion is then 0({\D\/Kf + \U\ 3 + \U\ 2 K). 

Let the maximum out-degree of G be denoted by 6. 
Then, each sensor has to consider A = 6 L possible 
walks of length L. The active sensing component of 
D 2 FAS involves computing $ fc in 0(AL\U\ 2 ) time, a k 
in 0[A 2 L 2 \U\K) time, its residing connected compo- 
nent in C(« 2 ) time, and the maximum-entropy joint walk 
by (JTTJ and ( |24| with the following incurred time: The 
largest connected component of k sensors in Q has to con- 
sider A K possible joint walks. Note that ^y wv y wv = 

^S\(^Y mk Y Wk \u)keV n ) + ^Y wv jj^ u 1 u ^UY WVn where 
diag(_B) constructs a diagonal matrix by placing vector 
B on its diagonal. By exploiting <!>£., the diagonal and 
latter matrix terms for all possible joint walks can be 
computed in 0(kA(L\U\ 2 + L 2 \U\)) and 0(k 2 A 2 L 2 \U\) 
time, respectively. For each joint walk Wy n , evalu- 
ating the determinant of £y ro y ro incurs C((kL) 3 ) 
time. Therefore, the time complexity of active sensing is 
0(nAL\U\ 2 + A 2 L 2 \U\(K + n 2 ) + A k (kL) 3 ). 

Hence, the time complexity of our D 2 FAS algorithm is 
0{(\D\/K) 3 + \U\ 2 (\U\ +K + kAL) + A 2 L 2 \U\{K + 
k 2 ) + A k (kL) 3 ). In contrast, the time incurred by 
centralized active sensing coupled with FGP and SoD 
are, respectively, 0(\D\ 3 + A K KL(\D\ 2 + (KL) 2 )) and 
0(\U\ 3 \D\+ A K KL(\U\ 2 + (KL) 2 )). It can be ob- 
served that D 2 FAS can scale better with large \D\ (i.e., 
number of observations) and K (i.e., number of sensors). 
The scalability of D 2 FAS vs. FGP and SoD will be further 
evaluated empirically in Section|6] 

5.2 Communication Complexity 

Let the communication overhead be defined as the size of 
each broadcast message. Recall from the data fusion com- 
ponent of D 2 FAS in Algorithm [T] that, in each iteration, 
each sensor broadcasts a C(|[/| 2 ) -sized summary encapsu- 
lating its local observations, which is robust against com- 
munication failure. In contrast, FGP and SoD require each 
sensor to broadcast, in each iteration, a 0(|D|/i4T)-sized 
message comprising exactly its local observations to han- 
dle communication failure. If the number of local obser- 
vations grows to be larger in size than a local summary of 
predefined size, then the data fusion component of D 2 FAS 
is more scalable than FGP and SoD in terms of communica- 
tion overhead. For the partially decentralized active sensing 
component of D 2 FAS, each sensor broadcasts 0(AL\U\)- 
sized $fe and C(A")-sized a k messages. 



Experiments and Discussion 

This 

I 
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section evaluates 
the predictive perfor- 
mance, time efficiency, 
and scalability of our 
D 2 FAS algorithm on a 
real-world traffic phe- 
nomenon (i.e., speeds 
(km/h) of road seg- 
ments) over an urban 
road network (top figure) in Tampines area, Singapore 
during evening peak hours on April 20, 2011. It comprises 
775 road segments including highways, arterials, slip 
roads, etc. The mean speed is 48.8 km/h and the standard 
deviation is 20.5 km/h. 

The performance of D 2 FAS is compared to that of central- 



If 



ized active sensing ( 17 i coupled with the state-of-the-art 
data fusion methods: full GP (FGP) and SoD (Section^. A 
network of K mobile sensors is tasked to explore the road 
network to gather a total of up to 960 observations. To re- 
duce computational time, each sensor repeatedly computes 
and executes maximum-entropy walks of length L = 2 
(instead of computing a very long walk), unless otherwise 
stated. For D 2 FAS and SoD, \U\ is set to 64 . For the active 
sensing component of D 2 FAS, e is set to 0.1, unless other- 
wise stated. The experiments are run on a Linux PC with 
Intel® Core™2 Quad CPU Q9550 at 2.83 GHz. 

6.1 Performance Metrics 

The first metric evaluates the predictive performance of a 
tested algorithm: It measures the root mean squared er- 
ror (RMSE) 



| 1 J2 se v ( z s ~ Ms) over the entire do- 
main V of the road network that is incurred by the predic- 
tive mean fl s of the tested algorithm, specifically, using ([I]) 
of FGP, Q of SoD, or ([To} of D 2 FAS. The second met- 
ric evaluates the time efficiency and scalability of a tested 
algorithm by measuring its incurred time; for D 2 FAS, the 
maximum of the time incurred by all subsets Vi, ■ • • , Vjc of 
sensors is recorded. 

6.2 Results and Analysis 

Predictive performance and time efficiency. Fig.[T]shows 
results of the performance of the tested algorithms averaged 
over 40 randomly generated starting sensor locations with 
varying number K = 4, 6, 8 of sensors. It can be observed 
that D 2 FAS is significantly more time-efficient and scales 
better with increasing number \D\ of observations (Figs.[Tjl 
to[lJ) while achieving predictive performance close to that 
of centralized active sensing coupled with FGP and SoD 
(Figs.[TJi to [!}:). Specifically, D 2 FAS is about 1, 2, 4 orders 
of magnitude faster than centralized active sensing coupled 
with FGP and SoD for K = 4, 6, 8 sensors, respectively. 

Scalability of D 2 FAS. Using the same results as that in 
Fig. [T] Fig.|2]plots them differently to reveal the scalability 
of the tested algorithms with increasing number K of sen- 




i yv-wwvw\ 



ID! i ii' observations 



(d) K = 4 (e) K = 6 (f) K -- 

Figure 1 : Graphs of (a-c) predictive performance and (d-f) 
time efficiency vs. total no. \D\ of observations gathered 
by varying number K of mobile sensors. 
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(f) SoD 



(e) FGP 

Figure 2: Graphs of (a-c) predictive performance and (d-f) 
time efficiency vs. total no. |Z?| of observations gathered 
by varying number K of mobile sensors. 

sors. Additionally, we provide results of the performance 
of D 2 FAS for K = 10, 20, 30 sensors; such results are not 
available for centralized active sensing coupled with FGP 
and SoD due to extremely long incurred time. It can be ob- 
served from Figs. [2^ to|2j; that the predictive performance 
of all tested algorithms improve with a larger number of 
sensors because each sensor needs to execute fewer walks 
and its performance is therefore less adversely affected by 
its myopic selection (i.e., L = 2) of maximum-entropy 
walks. As a result, more informative unobserved road seg- 
ments are explored. 

As shown in Fig. |2jl, when the randomly placed sensors 
gather their initial observations (i.e., \D\ < 400), the time 
incurred by D 2 FAS is higher for greater K due to larger 
subsets of sensors being formed to coordinate their walks 
(i.e., larger n). As more observations are gathered (i.e., 
\D\ > 400), its partially decentralized active sensing com- 
ponent directs the sensors to explore further apart from 
each other in order to maximize the entropy of their walks. 
This consequently decreases k, leading to a reduction in 
incurred time. Furthermore, as K increases from 4 to 20, 
the incurred time decreases due to its decentralized data fu- 
sion component that can distribute the computational load 
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Figure 3: Graphs of time efficiency vs. total no. \D\ of 
observations gathered by varying number K of sensors. 

among a greater number of sensors. When the road net- 
work becomes more crowded from K = 20 to K = 30 
sensors, the incurred time increases slightly due to slightly 
larger k. In contrast, Figs. [2^ and ^ show that the time 
taken by FGP and SoD increases significantly primarily 
due to their centralized active sensing incurring exponential 
time in K. Hence, the scalability of our D 2 FAS algorithm 
in the number of sensors allows the deployment of a larger- 
scale mobile sensor network (i.e., K > 10) to achieve more 
accurate traffic modeling and prediction (Figs. [2^ to|2};). 

Scalability of data fusion. Fig. [3] shows results of the 
scalability of the tested data fusion methods with increas- 
ing number K of sensors. In order to produce meaning- 
ful results for fair comparison, the same active sensing 
component has to be coupled with the data fusion meth- 
ods and its incurred time kept to a minimum. As such, 
we impose the use of a fully decentralized active sens- 
ing component to be performed by each mobile sensor k: 
w* k = argmax H[Zy u , \Z D ]. For D 2 FAS, this corre- 



sponds exactly to p4| ) by setting a large enough e (in our 
experiments, e = 2) to yield k = 1; consequently, compu- 
tational and communicational operations pertaining to the 
coordination graph can be omitted. 

It can be seen from Fig. [3^ that the time incurred by the 
decentralized data fusion component of D 2 FAS decreases 
with increasing K, as explained previously. In contrast, the 
time incurred by FGP and SoD increases (Fig.[3|3 and[3};): 
As discussed above, a larger number of sensors results in 
a greater quantity of more informative unique observations 
to be gathered (i.e., fewer repeated observations), which 
increases the time needed for data fusion. When K > 10, 
D 2 FAS is at least 1 order of magnitude faster than FGP 
and SoD. It can also be observed that D 2 FAS scales better 
with increasing number of observations. So, the real-time 
performance and scalability of D 2 FAS's decentralized data 
fusion enable it to be used for persistent large-scale traffic 
modeling and prediction where a large number of obser- 
vations and sensors (including static and passive ones) are 
expected to be available. 

Varying length L of walk. Fig. [4] shows results of the 
performance of the tested algorithms with varying length 
L = 2, 4, 6, 8 of maximum-entropy joint walks; we choose 
to experiment with just 2 sensors since Figs.[2]and[3]reveal 
that a smaller number of sensors produce poorer predictive 
performance and higher incurred time with large number of 
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Figure 4: Graphs of (a-c) predictive performance and (d-f) 
time efficiency vs. total no. \D\ of observations gathered 
by 2 mobile sensors with varying length L of maximum- 
entropy joint walks. 

observations for D 2 FAS. It can be observed that the predic- 
tive performance of all tested algorithms improve with in- 
creasing walk length L because the selection of maximum- 
entropy joint walks is less myopic. The time incurred by 
D 2 FAS increases due to larger k but grows more slowly 
and is lower than that incurred by centralized active sens- 
ing coupled with FGP and SoD. Specifically, when L = 8, 
D 2 FAS is at least 1 order of magnitude faster (i.e., aver- 
age of 60 s) than centralized active sensing coupled with 
SoD (i.e., average of > 732 s) and FGP (i.e., not available 
due to excessive incurred time). Also, notice from Figs. [2^ 
and[2jl that if a large number of sensors (i.e., K = 30) 
is available, D 2 FAS can select shorter walks of L = 2 to 
be significantly more time-efficient (i.e., average of > 3 
orders of magnitude faster) while achieving predictive per- 
formance comparable to that of SoD with L = 8 and FGP 
with L = 6. 

7 Conclusion 

This paper describes a decentralized data fusion and active 
sensing algorithm for modeling and predicting spatiotem- 
poral traffic phenomena with mobile sensors. Analytical 
and empirical results have shown that our D 2 FAS algo- 
rithm is significantly more time-efficient and scales better 
with increasing number of observations and sensors while 
achieving predictive performance close to that of state-of- 
the-art centralized active sensing coupled with FGP and 
SoD. Hence, D 2 FAS is practical for deployment in a large- 
scale mobile sensor network to achieve persistent and ac- 
curate traffic modeling and prediction. For our future work, 
we will assume that each sensor can only communicate lo- 
cally with its neighbors (instead of assuming all-to-all com- 
munication) and develop a distributed data fusion approach 
to efficient and scalable approximate GP prediction based 
on D 2 FAS and consensus filters dOlfati-Saberl |2005b. 
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A Proof of Theorem [Tfe 

We have to first simplify the Tyd (J^dd , 
the expressions of p Y T \D ' ^ 1 ^yy|£> {j^ 

(r^ + A)- 1 

= {^DU^ul/^UD + A) 
. _ 1 . _1 



A) 1 term in 



— y-'UU I "I 

= A 1 — A 1 YjduY 1uu Y j ud^ 



(26) 

The second equality follows from matrix inversion lemma. 
The last equality is due to 

IS 

t ^UD k ^ D l Dklu ^D k U 



K 

E 

fe=i 

A:=l 



(27) 



Using ((14) and ((26), 

T YD {Tdd+A)- 1 

= Yyu^uu^ud (a 1 — A 1 Sc;7S c/ yS(7£)A 

= J^YU^UU^UD^ 1 



A" 



(28) 



The third equality is due to (27 1. 
From ( fT2) , 

Mr|z> = Py + Tyc (Tdd + A) -1 (zd - ^d) 
= fiY + Yyu^uu^-'UD^ 1 (zd — M-d) 

= flY + ^YU^uu'ZU 

= ~Py ■ 

The second equality is due to ( [28] ). The third 
equality follows from S^cA^^zc — pjj) = 



From ( fl3) . 

yPITC 
Lj YY\D 



£yy — Tyu (Xdd + A) r^y 

Syy — SyyS^^Sj/DA ^^(yE^Sfyy 
= Syy — ^Eyyii^^EyDA ^^fyE^S/yy 

— ^YU^uiJ^UY^j — ^YU^UU^UY 

— Syy — YjYU^uU (pUD^ 1 ^DU ~ ^UU^j ^UU^UY 

— TjYU^'uU^-'UY 
= Y^YY — (pYU^UU^UY — ^YU^UU^UY 

/ 1 " 1 \ 



— Syy — Syf ( Yjj]j ~ £ 



Uu) ^UY 



— Syy . 

The second equality follows from ( fi4) and ( [28 
equality is due to (27 1. 



B Proof of Theorem H 



Let Sy TO y ra = Yjy w y w — ^y w y w and p w be the spectral 
radius of ^Sy^y^ Ey^y^. We have to first bound p w 
from above. 



For any joint walk w 



Sy^y^ comprises diag- 



onal blocks of size | Y WVn | x | Y wv | with components of 
value for n = 1, . . . , /C and off-diagonal blocks of the 

form (Sy y ] Sy y for n, n' — 1.....K. 

and n 7^ n'. We know that any pair of sensors k E V n 
and k' € V n ' reside in different connected components of 
coordination graph Q and are therefore not adjacent. So, by 
Definition |5] 



max 

i.i' 



J ^™,, Y w 



< S 



(29) 



for n,n' = 1, ...,/C and n ^ n'. Using (25 1 



and (29 1, each component in any off-diagonal block of 
^Sy ra y ra ^ ^y„y„ can be bounded as follows: 



max 

iA' 



( >; v ... v v ) 



< \Y WVn fe 



(30) 

for n, n' = 1, . . . , /C and rt 7^ rt'. It follows from ( 30 1 that 



max 

iA' 



^Y„Y m ) £ 



v > < max \Y W \£e<LK£e. 

n 1 n 1 

(31) 

The last inequality is due to max \ Y Wv I < L max |V„| < 
Lk. Then, 



Pw < 



J Y W Y U 



< max 

< KL 2 K{,£ 



(£y ro y ro ) X : 



(32) 



The first two inequalities follow from standard properties 
of matrix norm ( |Golub and Van Loan| |1996| |Stewart and| 



Sun 1990). The last inequality is due to (31 1. 



The rest of this proof utilizes the following result of Ipsen 



and Lee (2003) that is revised to reflect our notations: 
Theorem 3 If \Y w \p 2 w < 1, then log]Ey ro y J < 



log 



J Y W Y^ 



< loglEy^yJ ~ logfl - 1^1^) far any 



joint walk w. 

Using Theorem[3]followed by ( 32 1 



The fifth log 



log I T,Y w Y a I < log 

<log 



1 



l-\Y w \p 2 w 
1 



(33) 



for any joint walk w. 



= \ {{\Y W *\- I^Dlog^TreHloglEy^ 



< 



(|K 10 .|-|I'ffi|)log(27re)+ log 



2 



< - 
~ 2 

< ^ log 

"2 6 1 - (^15X2.5^)2 



|r ffi |)log(27re)+log 
1 



log|Ey e : 
- log|£y ffl 

loglSy y I 



The first equality is due to ( pT8[ >. The first, second, and last 
inequalities follow from Theorem|3] ( |23] >, and p3] l, respec- 
tively. 



